Relation Information Extraction Using Deep Syntactic Analysis

نویسندگان

  • Akane Yakushiji
  • Jun-ichi TSUJII
چکیده

There has been an increasing need for natural language processing technology to Information Extraction (IE), such as relations between entities, which are more informative than mere documents searched by key words. This dissertation proposes a novel method to construct and utilize extraction patterns for relation extraction based on deep syntactic relations obtained by full parsing. The process which requires the most amount of manual work in construction of IE systems is construction of extraction patterns which extract target information from source texts, because the same information can be represented through many kinds of syntactic variations. To reduce this amount of manual work, our approach has two phases: First, we raise representation ability of extraction patterns and reduce number of necessary patterns by normalizing syntactic variations into predicateargument structures (PASs) using a full parser based on Head-driven Phrase Structure Grammar (HPSG). Then, PASs which connect entities to extract in a small training corpus are considered as extraction patterns, and we divide them into components and utilize combinations of the components for generalization. As a real world application, we have constructed an IE system for protein-protein interactions, which are important knowledge in biomedical research. We evaluated the IE system on a small test-case corpus and a large real-world corpus, and show its effectiveness. This dissertation also describes aspects that should be considered to ensure effectiveness of full parsers on domain-specific IE. The first aspect is the ability of deep syntactic relations obtained by parsing to capture syntactic information, which is necessary for constructing extraction patterns. To show enough accuracy of full parsing on a biomedical text, we evaluated precision of primitive PASs obtained from a biomedical text by an HPSG parser. And to compare performance of PAS patterns to patterns of partof-speeches, we also evaluated performance of verb-argument relations obtained from a biomedical text by PAS patterns and by patterns of part-of-speeches. The second aspect is difficulties to apply general-purpose parsers to domain-specific domains. To measure domain-specific coverage of a general-purpose HPSG, we investigated deficiencies of the grammar on parsing a biomedical text. We also show preliminary investigation on performance of general-purpose parsers that suggested parsing accuracy on general corpus does not ensure parsing accuracy or IE accuracy on a domain-specific text. Through all results on this dissertation, we show that full parsing is effective for IE. To obtain more performance of a domain-specific IE with full parsing, we should use shallow information in sentences, such as surface words, in combination of full parsing results. And it is also necessary to develop a full parser not only with consideration to general-purpose corpora but also with consideration to domain-specific text. 論文要旨 近年、自然言語処理技術を、キーワードによる文書検索などに留まらず、より情報性に優 れる entity間関係情報を取り出すような情報抽出に用いたいという要請がある。本論文で は、深い構文解析によって得られる深い統語関係に基づいた、関係情報抽出のための抽出パ ターンを構築し利用するための新しい手法を提案する。情報抽出システムの構築において は、同一の情報を表す統語上の変形が多数存在するために、抽出パターンの構築が最も人手 による作業を要する。この人手による作業量を減らすために、本論文では次の2段階からな るアプローチを取る。まず、Head-driven Phrase Structure Grammar(HPSG)に基づく深 い構文解析器を用い統語上の変形を述語項構造の形に正規化することで、抽出パターンの表 現能力を上げ必要なパターン数を減らす。次に、抜き出すべき entityを結ぶような述語項構 造を抽出パターンとして小さな訓練コーパスから自動的に取り出し、一般化のためこれら得 られた抽出パターンをさらに部品に分割しそれらの組み合わせを用いる。実世界への適用例 として、本論文では生医学研究において重要であるとされるタンパク質間相互作用を抽出す るような情報抽出システムを構築した。このシステムを小さな試験用コーパスおよび大規模 な実世界テキストに適用して評価することで、本論文で主張する手法の有効性を示す。 また本論文では、特定分野での情報抽出に深い構文解析器を用いる際の有効性を保証す るために考慮しなければならない諸問題について述べる。第一の問題は、深い構文解析器に よって得られた深い統語関係は、抽出パターンを構築するために必要な統語情報を十分に表 しうるかという点である。生医学文書において深い構文解析が十分な精度を持つことを示す ため、本論文では HPSG構文解析器によって生医学文書から得られた primitiveな PASの 精度を評価する。また品詞パターンと PASに基づくパターンを比較するため、生医学論文 から得られる動詞項関係を品詞パターンを用いて抜き出した場合と PASパターンを用いて 抜き出した場合で比較し評価する。第二の問題は、汎用の目的で開発された構文解析器を特 定分野に適用する際の問題である。汎用の目的で開発されたHPSG文法のカヴァレッジを計 るため、生医学論文を解析するにあたってこの文法がどのような欠陥が持つかを調査した。 また汎用の構文解析器の一般分野コーパスにおける構文解析精度は生医学論文における構文 解析精度や情報抽出精度を保証しないことを示す。 本論文全体を通じ、我々は深い構文解析が情報抽出に有効であることを示す。さらにより よい性能を特定分野の情報抽出において得るためには、表層語のような文中の局所的情報を 深い構文解析結果とともに用いる必要がある。また、実際の情報抽出タスクへ利用するため の深い構文解析器を開発する際には、汎用コーパスのみならず対象分野に特化したコーパス をも使用する必要がある。

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Integrating Syntactic and Semantic Analysis into the Open Information Extraction Paradigm

In this paper we present an approach aimed at enriching the Open Information Extraction paradigm with semantic relation ontologization by integrating syntactic and semantic features into its workflow. To achieve this goal, we combine deep syntactic analysis and distributional semantics using a shortest path kernel method and soft clustering. The output of our system is a set of automatically di...

متن کامل

Large-Scale Information Extraction from Textual Definitions through Deep Syntactic and Semantic Analysis

We present DEFIE, an approach to largescale Information Extraction (IE) based on a syntactic-semantic analysis of textual definitions. Given a large corpus of definitions we leverage syntactic dependencies to reduce data sparsity, then disambiguate the arguments and content words of the relation strings, and finally exploit the resulting information to organize the acquired relations hierarchic...

متن کامل

Exploring syntactic structured features over parse trees for relation extraction using kernel methods

Extracting semantic relationships between entities from text documents is challenging in information extraction and important for deep information processing and management. This paper proposes to use the convolution kernel over parse trees together with support vector machines to model syntactic structured information for relation extraction. Compared with linear kernels, tree kernels can effe...

متن کامل

A New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model

Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...

متن کامل

Extracting Relations with Integrated Information Using Kernel Methods

Entity relation detection is a form of information extraction that finds predefined relations between pairs of entities in text. This paper describes a relation detection approach that combines clues from different levels of syntactic processing using kernel methods. Information from three different levels of processing is considered: tokenization, sentence parsing and deep dependency analysis....

متن کامل

FBK-IRST: Kernel Methods for Semantic Relation Extraction

We present an approach for semantic relation extraction between nominals that combines shallow and deep syntactic processing and semantic information using kernel methods. Two information sources are considered: (i) the whole sentence where the relation appears, and (ii) WordNet synsets and hypernymy relations of the candidate nominals. Each source of information is represented by kernel functi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006